Assignment 1

1

In our first plot, linoleic is a continous variable with colored in hue’s of blue. The channel capacity for hue is only 10 levels and the problem of occusion with the data also deteriorates our understanding of the plot. In the second plot, We have linoleic variable segmented into 4 groups, this gives us a quicker understanding of the data showing us the relative values for the groups. The perception problem of relative judgement is affected as color hue comes with the highest error in human beings.

ol = read.csv("olive.csv", header = T, row.names = 1)
head(ol, 10)
##    Region         Area palmitic palmitoleic stearic oleic linoleic
## 1       1 North-Apulia     1075          75     226  7823      672
## 2       1 North-Apulia     1088          73     224  7709      781
## 3       1 North-Apulia      911          54     246  8113      549
## 4       1 North-Apulia      966          57     240  7952      619
## 5       1 North-Apulia     1051          67     259  7771      672
## 6       1 North-Apulia      911          49     268  7924      678
## 7       1 North-Apulia      922          66     264  7990      618
## 8       1 North-Apulia     1100          61     235  7728      734
## 9       1 North-Apulia     1082          60     239  7745      709
## 10      1 North-Apulia     1037          55     213  7944      633
##    linolenic arachidic eicosenoic
## 1         36        60         29
## 2         31        61         29
## 3         31        63         29
## 4         50        78         35
## 5         50        80         46
## 6         51        70         44
## 7         49        56         29
## 8         39        64         35
## 9         46        83         33
## 10        26        52         30
ggplot(ol, aes(palmitic, oleic, col = linolenic)) + geom_point()

disc = cut_interval(ol$linolenic, 4)
ggplot(ol, aes(palmitic, oleic, col = disc)) + geom_point()

2

The 2nd plot is the easiest to analyse the plot with linolenic segmented into 4 groups. The size mapping creates the problem of occlusion due to overlapping. The orientaion angle map does not help either as the scatter plot many observations creates a high relative judement error. a)With Color hue 10 levels of feature can be percieved and 3.1bits can be decoded,With Color Brightness 5 levels and 2.1bits can be decoded. b)With size of object 4-5levels of feature can be percieved depending on human subject’s individualistic abilities, and 2.2bits can be decoded for this aesthetic. c)line orientation : 3bits can be decoded for this feature.

ggplot(ol, aes(palmitic, oleic, col = disc)) + geom_point()

ggplot(ol, aes(palmitic, oleic, size = disc)) + geom_point()
## Warning: Using size for a discrete variable is not advised.

ggplot(ol, aes(palmitic, oleic)) + geom_point() + 
    geom_spoke(angle = ol$linolenic, radius = 40)

3

Treisman’s theory of preattentive processing is showcased in this example. With no segmentation done by Region, we do not easily identify the boundary, but with the second plot we see the same much quickly due to preattentive preprocessing of contrast and luminance.

ggplot(ol, aes(oleic, eicosenoic, col = Region)) + geom_point()

ggplot(ol, aes(oleic, eicosenoic, col = cut_interval(Region,3))) + geom_point()

4

The 3 colors are each mapped with contrast and size and these feature maps are parallely processed in our brain as this creates a problem of preattentive preprocessing while analysing 27 different types of observation as we tend to wrongly group the data in our perception

ggplot(ol, aes(ol$oleic, ol$eicosenoic, col = cut_interval(ol$linoleic, 3),
              shape = cut_interval(ol$palmitic, 3), 
              size = cut_interval(ol$palmitoleic, 3))) + geom_point()
## Warning: Using size for a discrete variable is not advised.

5

Size, contrast and shape are individual feature maps that are linked to different colors and hence preattentive preprocessing helps in this case.

ggplot(ol, aes(ol$oleic, ol$eicosenoic, col = ol$Region,
              shape = cut_interval(ol$palmitic, 3), 
              size = cut_interval(ol$palmitoleic, 3))) + geom_point()
## Warning: Using size for a discrete variable is not advised.

6

Relative Judgement due to area is very high due to the plot made as a pie size as the dominant group of South- Apulia looks much larger than the other groups.

p <- plot_ly(ol, labels = ~Area, type = 'pie', showlegend = FALSE) %>%
  layout(title = 'Pie Chart Area')
p

7

It is hard to look for outliers in the contour plot compared to the scatter plot. In the contour plot it shows we have 5 peak values but you wont be able to spot any difference for it in the scatter plot. The extreme values are not plotted in the contour plot. It is also hard to figure out clusters in the contour plot compared to a scatter plot which is a big issue in this plot.

ggplot(ol, aes(ol$linoleic, ol$eicosenoic)) + geom_density2d()

ggplot(ol, aes(ol$linoleic, ol$eicosenoic)) + geom_point()

Assignment 2

1 - Read Data

The columns vary a lot in the range. Some values like BAvg are in a range of 0.235 to 0.282 while values like TB are in the range 2090 to 2615. But when we scale and apply MDS on the scaled values the stress level increases. The goodness of fit decreases. This is why I think we should not scale the values before applying MDS on it.

bball = read.xlsx("baseball-2016.xlsx", sheetName = "Sheet1", header = TRUE,
                  row.names = 1)
head(bball)
##                     League Won Lost Runs.per.game HR.per.game   AB Runs
## Aizona Diamondbacks     NL  69   93          4.64    1.172840 5665  752
## Atlanta Braves          NL  68   93          4.03    0.757764 5514  649
## Baltimore Orioles       AL  89   73          4.59    1.561728 5524  744
## Boston Red Sox          AL  93   69          5.42    1.283951 5670  878
## Chicago Cubs            NL 103   58          4.99    1.236025 5503  808
## Chicago White Sox       AL  78   84          4.23    1.037037 5550  686
##                     Hits X2B X3B  HR RBI StolenB CaughtS  BB   SO  BAvg
## Aizona Diamondbacks 1479 285  56 190 709     137      31 463 1427 0.261
## Atlanta Braves      1404 295  27 122 615      75      34 502 1240 0.255
## Baltimore Orioles   1413 265   6 253 710      19      13 468 1324 0.256
## Boston Red Sox      1598 343  25 208 836      83      24 558 1160 0.282
## Chicago Cubs        1409 293  30 199 767      66      34 656 1339 0.256
## Chicago White Sox   1428 277  33 168 656      77      36 455 1285 0.257
##                       OBP   SLG   OPS   TB GDP HBP SH SF IBB  LOB
## Aizona Diamondbacks 0.320 0.432 0.752 2446 117  50 43 38  43 1113
## Atlanta Braves      0.321 0.384 0.705 2119 145  59 64 52  60 1161
## Baltimore Orioles   0.317 0.443 0.760 2449 119  44 17 36  19 1065
## Boston Red Sox      0.348 0.461 0.810 2615 137  43  8 40  34 1162
## Chicago Cubs        0.343 0.429 0.772 2359 107  96 42 37  45 1217
## Chicago White Sox   0.317 0.410 0.727 2275 122  53 29 44  16 1105

2 - Non-metric MDS

It is hard to see a difference between the legues in this plot. The points are well spread out so it is hard to tell if a MDS component is providing better differentiation between the leagues. I felt V1 was doing a better split between the leagues compared to V2. According to this plot “Los Angeles Angels”, “Boston Red Sox” and “Colorado Rocies” look like outliers.

bball.numeric = bball[,3:27]
distance = dist(bball.numeric)
res = isoMDS(distance, k=2, p=2)
## initial  value 12.033362 
## final  value 12.032500 
## converged
coords = res$points
coordsMDS = as.data.frame(coords)
coordsMDS$name = rownames(coordsMDS)
coordsMDS$league = bball$League
plot_ly(coordsMDS, x=~V1, y=~V2, type="scatter", mode = "markers"
        , hovertext=~name, color= ~league)

3 - Shepard Plot

MDS was able to decress the stress value upto 12%. Given that the dataset was 26 dimension and getting it down to 2 dimensions, with stress level of 2 is good. Some of the observation pairs was hard for MDS to map, and it was almost always including Chicago clubs. Some of the pairs that were hard to map are Chicago clubs and Arizona Diamondbacks, Chicago clubs and Baltimore Orioles, chicago clubs and Kansas city royals.

sh <- Shepard(distance, coords)
delta <-as.numeric(distance)
D<- as.numeric(dist(coords))

n=nrow(coords)
index=matrix(1:n, nrow=n, ncol=n)
index1=as.numeric(index[lower.tri(index)])

n=nrow(coords)
index=matrix(1:n, nrow=n, ncol=n, byrow = T)
index2=as.numeric(index[lower.tri(index)])

plot_ly()%>%
  add_markers(x=~delta, y=~D, hoverinfo = 'text',
              text = ~paste('Obj1: ', rownames(bball)[index1],
                            '<br> Obj 2: ', rownames(bball)[index2]))%>%
  add_lines(x=~sh$x, y=~sh$yf)

4 -

Since V1 was spliting the leagues better, I plotted all the variables against it and found that TB(Total Bases) and OPS(On Base plus slugging) had the maximum positive connection between them. I have included the plots for the two. Both TB and OPS showed very strong positive connection. These two variables are very important in scoring the baseball teams.

plots_bball$P20 #OPS  **

plots_bball$P21 #TB  **

Appendix

library(ggplot2)
library(plotly)
library(xlsx)
library(MASS)
library(gridExtra)
#Assignment 1
#Q1
ol = read.csv("olive.csv", header = T, row.names = 1)
head(ol, 10)

ggplot(ol, aes(palmitic, oleic, col = linolenic)) + geom_point()

disc = cut_interval(ol$linolenic, 4)
ggplot(ol, aes(palmitic, oleic, col = disc)) + geom_point()


#Q2
ggplot(ol, aes(palmitic, oleic, col = disc)) + geom_point()
ggplot(ol, aes(palmitic, oleic, size = disc)) + geom_point()
ggplot(ol, aes(palmitic, oleic)) + geom_point() + 
    geom_spoke(angle = ol$linolenic, radius = 40)

#Q3
ggplot(ol, aes(oleic, eicosenoic, col = Region)) + geom_point()


#Q4
ggplot(ol, aes(ol$oleic, ol$eicosenoic, col = cut_interval(ol$linoleic, 3),
              shape = cut_interval(ol$palmitic, 3), 
              size = cut_interval(ol$palmitoleic, 3))) + geom_point()


#Q5
ggplot(ol, aes(ol$oleic, ol$eicosenoic, col = ol$Region,
              shape = cut_interval(ol$palmitic, 3), 
              size = cut_interval(ol$palmitoleic, 3))) + geom_point()

#Q6
p <- plot_ly(ol, labels = ~Area, type = 'pie', showlegend = FALSE) %>%
  layout(title = 'Pie Chart Area')
p

#Q7
ggplot(ol, aes(ol$linoleic, ol$eicosenoic)) + geom_density2d()
ggplot(ol, aes(ol$linoleic, ol$eicosenoic)) + geom_point()


#Assignment 2
#Q1

bball = read.xlsx("baseball-2016.xlsx", sheetName = "Sheet1", header = TRUE,
                  row.names = 1)
head(bball)

#Q2
bball.numeric = bball[,3:27]
distance = dist(bball.numeric)
res = isoMDS(distance, k=2, p=2)
coords = res$points

coordsMDS = as.data.frame(coords)
coordsMDS$name = rownames(coordsMDS)
coordsMDS$league = bball$League
plot_ly(coordsMDS, x=~V1, y=~V2, type="scatter", mode = "markers"
        , hovertext=~name, color= ~league)


#Q3
sh <- Shepard(distance, coords)
delta <-as.numeric(distance)
D<- as.numeric(dist(coords))

n=nrow(coords)
index=matrix(1:n, nrow=n, ncol=n)
index1=as.numeric(index[lower.tri(index)])

n=nrow(coords)
index=matrix(1:n, nrow=n, ncol=n, byrow = T)
index2=as.numeric(index[lower.tri(index)])

plot_ly()%>%
  add_markers(x=~delta, y=~D, hoverinfo = 'text',
              text = ~paste('Obj1: ', rownames(bball)[index1],
                            '<br> Obj 2: ', rownames(bball)[index2]))%>%
  add_lines(x=~sh$x, y=~sh$yf)

#Q4
bball$V1 = coordsMDS$V1
bball$V2 = coordsMDS$V2

cols_bball = colnames(bball)
dim(bball)[2]
plots_bball = list()
for(i in 2:27){
  pl_name = paste("P", i, sep = '')
  col_name = cols_bball[i]
  plots_bball[[pl_name]] = ggplot(bball, aes_string("V1", col_name)) + 
    geom_point() + geom_line(aes(x=0))
}
grid.arrange(grobs = plots_bball, ncol = 6, nrow = 6)

plots_bball$P2
plots_bball$P3
plots_bball$P4 #Runs per game
plots_bball$P5
plots_bball$P6 #AB
plots_bball$P7 #Runs **
plots_bball$P8 #Hits **
plots_bball$P9
plots_bball$P10
plots_bball$P11
plots_bball$P12 #RBI **
plots_bball$P13
plots_bball$P14
plots_bball$P15
plots_bball$P16
plots_bball$P17 #BAvg
plots_bball$P18
plots_bball$P19 #SLG
plots_bball$P20 #OPS  **
plots_bball$P21 #TB  **
plots_bball$P22
plots_bball$P23
plots_bball$P24
plots_bball$P25
plots_bball$P26
plots_bball$P27

plots_bball2 = list()
for(i in 2:27){
  pl_name = paste("P", i, sep = '')
  col_name = cols_bball[i]
  plots_bball2[[pl_name]] = ggplot(bball, aes_string("V2", col_name)) + 
    geom_point() + geom_line(aes(x=0))
}
grid.arrange(grobs = plots_bball2, ncol = 6, nrow = 6)
plots_bball2$P2
plots_bball2$P3
plots_bball2$P4
plots_bball2$P5
plots_bball2$P6
plots_bball2$P7
plots_bball2$P8
plots_bball2$P9
plots_bball2$P10
plots_bball2$P11
plots_bball2$P12
plots_bball2$P13
plots_bball2$P14
plots_bball2$P15
plots_bball2$P16
plots_bball2$P17
plots_bball2$P18
plots_bball2$P19
plots_bball2$P20
plots_bball2$P21
plots_bball2$P22
plots_bball2$P23
plots_bball2$P24
plots_bball2$P25
plots_bball2$P26
plots_bball2$P27